AITopics

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsFeb-18-2026, 05:00:45 GMT

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition Chen Y eh 1 You-Ming Chang 1 Wei-Chen Chiu 1 Ning Y u

Warning: This paper contains inappropriate/harmful visual contents. While widespread access to the Internet and the rapid advancement of generative models boost people's creativity and productivity, the risk of encountering inappropriate or harmful content also increases.

large language model, machine learning, natural language, (21 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Neural Information Processing SystemsFeb-11-2026, 01:52:47 GMT

Into the LAION's Den: Investigating Hate in Multimodal Datasets Abeba Birhane Mozilla Foundation and School of Computer Science and Statistics Trinity College Dublin, Ireland Vinay Prabhu

'Scale the model, scale the data, scale the compute' is the reigning sentiment in the world of generative AI today.

artificial intelligence, machine learning, natural language, (19 more...)

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.86)
North America > United States > Michigan (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology (0.82)
Government (0.67)
Law Enforcement & Public Safety (0.67)
Law > Civil Rights & Constitutional Law (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.67)

Neural Information Processing SystemsDec-27-2025, 11:12:15 GMT

Who's asking? User personas and the mechanics of latent misalignment

Studies show that safety-tuned models may nevertheless divulge harmful information. In this work, we show that whether they do so depends significantly on who they are talking to, which we refer to as . In fact, we find manipulating user persona to be more effective for eliciting harmful content than certain more direct attempts to control model refusal. We study both natural language prompting and activation steering as intervention methods and show that activation steering is significantly more effective at bypassing safety filters.We shed light on the mechanics of this phenomenon by showing that even when model generations are safe, harmful content can persist in hidden representations and can be extracted by decoding from earlier layers. We also show we can predict a persona's effect on refusal given only the geometry of its steering vector. Finally, we show that certain user personas induce the model to form more charitable interpretations of otherwise dangerous queries.

artificial intelligence, name change, proceedings, (5 more...)

Technology: Information Technology > Artificial Intelligence (1.00)

BBC NewsDec-12-2025, 20:45:06 GMT

Online gaming escaped Australia's social media ban - but critics say it's just as addictive

Online gaming escaped Australia's social media ban - but critics say it's just as addictive Wednesday afternoons have become a ritual for 15-year-old Sadmir Perviz. It's a circuitous route from home in Perth to the Fiona Stanley Hospital - but it's worth it, he says, to sit down for a game of Dungeons & Dragons with people he may not know but with whom he shares a great deal in common. Sadmir and his board game companions are just some of the 300 patients at the gaming disorder clinic, Australia's only publicly-run institution of its type, helping patients wean themselves off excessive online gaming habits. The room where they meet is a simple space in a faceless hospital but in the corner, there's a pile of boardgames on a chair. Jenga, Uno and Sushi Go are also popular choices at the informal group which is attended by both patients and clinicians.

artificial intelligence, chatbot, natural language, (16 more...)

BBC News

Country:

Oceania > Australia (0.97)
North America > United States (0.29)
North America > Central America (0.14)
(15 more...)

Industry:

Leisure & Entertainment > Games > Computer Games (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.95)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.30)

Kazemi, Arefeh, Qadeer, Hamza, Wagner, Joachim, Hosseini, Hossein, Kalaivendan, Sri Balaaji Natarajan, Davis, Brian

SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection

arXiv.org Artificial IntelligenceDec-10-2025

We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.

large language model, machine learning, natural language, (19 more...)

2511.11599

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.64)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

arXiv.org Artificial IntelligenceDec-9-2025

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

Young, Richard

Despite substantial investment in safety alignment, the vulnerability of large language models to sophisticated multi-turn adversarial attacks remains poorly characterized, and whether model scale or inference mode affects robustness is unknown. This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors, generating over 97,000 API queries across adversarial conversations with automated evaluation by independent safety classifiers. Results demonstrated a spectrum of vulnerability: six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%; enabling extended reasoning on identical architecture reduced ASR from 97% to 42%. These findings indicate that safety alignment quality varies substantially across vendors, that model scale does not predict adversarial robustness, and that thinking mode provides a deployable safety enhancement. Collectively, this work establishes that current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, while identifying deliberative inference as a promising defense direction.

large language model, machine learning, natural language, (19 more...)

2512.07059

Country: North America > United States (0.68)

Genre: Research Report > Experimental Study (1.00)

Industry:

Media (1.00)
Law Enforcement & Public Safety (1.00)
Information Technology > Security & Privacy (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

arXiv.org Artificial IntelligenceDec-4-2025

When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI

Li, Yanhui, Zhou, Qi, Xu, Zhihong, Guo, Huizhong, Wang, Wenhai, Wang, Dongxia

Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: \textbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75\% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10\% accuracy). Moreover, fine-tuning experiments demonstrate that \bench serves as an effective resource for improving model perception, increasing accuracy by 55.94\% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2512.03087

Country: Asia > China (0.15)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.87)

arXiv.org Artificial IntelligenceNov-27-2025

GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

Xiang, Yuxiao, Chen, Junchi, Jin, Zhenchao, Miao, Changtao, Yuan, Haojie, Chu, Qi, Gong, Tao, Yu, Nenghai

Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.

large language model, machine learning, natural language, (20 more...)

2511.20994

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (0.68)
Law Enforcement & Public Safety (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
(2 more...)